Food & Environment 🌿¶

Over the past few years, concerns about the environmental impact of animal agriculture have grown. It is important to understand how our food choices affect the planet. By analyzing data on the environmental impacts of different types of food, we can gain valuable insights into how these products affect the environment.

These concerns are the driving force behind this analysis. In particular, animal agriculture is a significant contributor to greenhouse gas emissions. It is responsible for global methane and nitrous oxide emissions, which are powerful greenhouse gases. In addition, animal agriculture also uses large amounts of water and land.

To carry out this analysis, we will use Pandas for data manipulation and analysis, Seaborn for data visualization, Matplotlib for creating custom plots, SciPy for statistical tests and analysis, and SciKit-Learn for machine learning.

This tutorial will guide users through a typical data science pipeline, while also providing information about the issue of animal agriculture on the environment and why it is a pressing problem.

Data Collection¶

Initially, we must collect relevant data from the internet to conduct our analysis:

  1. Environmental Impact of Food Production
  2. Meat Consumption

We will download each respective CSV file and then use Pandas to read and manipulate them accordingly to set up for data management.

In [103]:
import pandas as pd # Import Pandas library

CSV1 = 'Food_Production.csv'  # Initialize CSV file containing environmental impact of food production data
FOOD_PRODUCTS = ["Bread", "Maize", "Barley",
                 "Oatmeal", "Rice", "Potato",
                 "Cassava", "Sugar", "Beet Sugar",
                 "Pulse", "Pea", "Nut",
                 "Groundnut", "Soymilk", "Tofu",
                 "Soybean Oil", "Palm Oil", "Sunflower Oil",
                 "Rapeseed Oil", "Olive Oil", "Tomato",
                 "Onion & Leek", "Root Veg.", "Brassica",
                 "Veg.", "Citrus Fruit", "Banana",
                 "Apple", "Berry & Grape", "Wine",
                 "Fruit", "Coffee", "Dark Chocolate",
                 "Beef", "Dairy Beef", "Lamb",
                 "Pig", "Poultry", "Milk",
                 "Cheese", "Egg", "Fish", "Shrimp"] # Create list of abbreviations for clearer graph labels (current ones are too long)

df1 = pd.read_csv(CSV1) # Read CSV file and store as a Pandas dataframe
df1.loc[:, "Food product"] = FOOD_PRODUCTS # Replace column in the dataframe called "Food product" and initialize with abbreviations
df1.head() # Print the first few entries of the dataframe
Out[103]:
Food product Land use change Animal Feed Farm Processing Transport Packging Retail Total_emissions Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal) ... Freshwater withdrawals per 100g protein (liters per 100g protein) Freshwater withdrawals per kilogram (liters per kilogram) Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal) Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein) Land use per 1000kcal (m² per 1000kcal) Land use per kilogram (m² per kilogram) Land use per 100g protein (m² per 100g protein) Scarcity-weighted water use per kilogram (liters per kilogram) Scarcity-weighted water use per 100g protein (liters per 100g protein) Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories)
0 Bread 0.1 0.0 0.8 0.2 0.1 0.1 0.1 1.4 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Maize 0.3 0.0 0.5 0.1 0.1 0.1 0.0 1.1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Barley 0.0 0.0 0.2 0.1 0.0 0.5 0.3 1.1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Oatmeal 0.0 0.0 1.4 0.0 0.1 0.1 0.0 1.6 4.281357 ... 371.076923 482.4 0.945482 1.907692 2.897446 7.6 5.846154 18786.2 14450.92308 7162.104461
4 Rice 0.0 0.0 3.6 0.1 0.1 0.1 0.1 4.0 9.514379 ... 3166.760563 2248.4 1.207271 6.267606 0.759631 2.8 3.943662 49576.3 69825.77465 13449.891480

5 rows × 23 columns

Now that we have created our first dataframe containing the first dataset, let's create the next one.

In [104]:
CSV2 = 'meat_consumption_worldwide.csv' # Initialize CSV file containing meat consumption worldwide data
NON_COUNTRIES = ["WLD", "BRICS", "EU28", "OECD"] # Create list of non-countries to remove

df2 = pd.read_csv(CSV2) # Read CSV file and store as a Pandas dataframe
df2 = df2[~df2["LOCATION"].isin(NON_COUNTRIES)] # Remove WLD/BRICS/EU28/OECD from the dataframe, since we only want to look at individual countries
df2.head() # Print the first few entries of the dataframe
Out[104]:
LOCATION SUBJECT MEASURE TIME Value
0 AUS BEEF KG_CAP 1991 27.721815
1 AUS BEEF KG_CAP 1992 26.199591
2 AUS BEEF KG_CAP 1993 26.169094
3 AUS BEEF KG_CAP 1994 25.456134
4 AUS BEEF KG_CAP 1995 25.340226

We are done reading our data, so we can continue to the next stage where we will deal with the NaN values.

Data Management¶

To ensure our data can be used coherently in a representation through graphs, we must manage it by creating new dataframes that can be easily graphed using Seaborn.

We will split df1 into two dataframes: df3 and df4. df3 will contain the first 9 columns, which are data calculated relative to each other. df4 will contain all the columns after, which are data calculated by a certain metric. This will be clearer when we graph the data.

In [105]:
END = 9 # Set the end range

df3 = df1.iloc[:, :END] # Keep the first 9 columns and store it as df1
df3.head() # Print the first few entries of the dataframe
Out[105]:
Food product Land use change Animal Feed Farm Processing Transport Packging Retail Total_emissions
0 Bread 0.1 0.0 0.8 0.2 0.1 0.1 0.1 1.4
1 Maize 0.3 0.0 0.5 0.1 0.1 0.1 0.0 1.1
2 Barley 0.0 0.0 0.2 0.1 0.0 0.5 0.3 1.1
3 Oatmeal 0.0 0.0 1.4 0.0 0.1 0.1 0.0 1.6
4 Rice 0.0 0.0 3.6 0.1 0.1 0.1 0.1 4.0
In [106]:
START = 9 # Set the start range
END = 23 # Set the end range
RANGE = [0] + list(range(START, END)) # Set the range of columns we want

df4 = df1.copy() # Create a copy to prevent warning
df4 = df4.iloc[:, RANGE] # Keep the columns after the 9th
df4.dropna(inplace = True) # Remove all NaNs since they don't provide any data
df4.reset_index(drop = True, inplace = True) # Reset our index for clarity
df4.head() # Print the first few entries of the dataframe
Out[106]:
Food product Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal) Eutrophying emissions per kilogram (gPO₄eq per kilogram) Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein) Freshwater withdrawals per 1000kcal (liters per 1000kcal) Freshwater withdrawals per 100g protein (liters per 100g protein) Freshwater withdrawals per kilogram (liters per kilogram) Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal) Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein) Land use per 1000kcal (m² per 1000kcal) Land use per kilogram (m² per kilogram) Land use per 100g protein (m² per 100g protein) Scarcity-weighted water use per kilogram (liters per kilogram) Scarcity-weighted water use per 100g protein (liters per 100g protein) Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories)
0 Oatmeal 4.281357 11.23 8.638462 183.911552 371.076923 482.4 0.945482 1.907692 2.897446 7.60 5.846154 18786.2 14450.92308 7162.104461
1 Rice 9.514379 35.07 49.394366 609.983722 3166.760563 2248.4 1.207271 6.267606 0.759631 2.80 3.943662 49576.3 69825.77465 13449.891480
2 Potato 4.754098 3.48 20.470588 80.737705 347.647059 59.1 0.628415 2.705882 1.202186 0.88 5.176471 2754.2 16201.17647 3762.568306
3 Nut 3.113821 19.15 11.726883 672.162602 2531.414574 4133.8 0.069919 0.263319 2.107317 12.96 7.936314 229889.8 140777.58730 37380.455280
4 Groundnut 2.437931 14.14 5.401070 319.362069 707.524828 1852.3 0.556897 1.233766 1.570690 9.11 3.479756 61797.9 23605.00382 10654.810340

Next, we split create new dataframes from df3 and df4: df5 and df6. df5 will include the relative environmental impact values grouped by whether they're animal or plant foods. df6 will include the metric-based environmental impact values also grouped by whether they're animal or plant foods.

This is important to do so we can later compare the environmental impact of animals vs plant foods to graph and run a hypothesis test on whether the difference is significant.

In [107]:
PLANT_PRODUCTS = {"Bread", "Maize", "Barley",
                 "Oatmeal", "Rice", "Potato",
                 "Cassava", "Sugar", "Beet Sugar",
                 "Pulse", "Pea", "Nut",
                 "Groundnut", "Soymilk", "Tofu",
                 "Soybean Oil", "Palm Oil", "Sunflower Oil",
                 "Rapeseed Oil", "Olive Oil", "Tomato",
                 "Onion & Leek", "Root Veg.", "Brassica",
                 "Veg.", "Citrus Fruit", "Banana",
                 "Apple", "Berry & Grape", "Wine",
                 "Fruit", "Coffee", "Dark Chocolate"} # Declare a set of all plant products to categorize as "Plant"

ANIMAL_PRODUCTS = {"Beef", "Dairy Beef", "Lamb",
                   "Pig", "Poultry", "Milk",
                   "Cheese", "Egg", "Fish", "Shrimp"} # Declare a set of all animal products to categorize as "Animal"

df5 = df3.copy() # Duplicate df3 for manipulation and store it as df5
df6 = df4.copy() # Duplicate df4 for manipulation and store it as df6

df5["Food type"] = None # Create a new "Food type" column in df5 to categorize "Plant" or "Animal"
df6["Food type"] = None # Create a new "Food type" column in df6 to categorize "Plant" or "Animal"

# Iterate through df5 and then categorize individual rows as "Plant" or "Animal" depending on which set "Food product" is in
for i, r in df5.iterrows(): 
  if r["Food product"] in PLANT_PRODUCTS:
    df5.loc[i, "Food type"] = "Plant"
  elif r["Food product"] in ANIMAL_PRODUCTS:
    df5.loc[i, "Food type"] = "Animal"

# Iterate through df6 and then categorize individual rows as "Plant" or "Animal" depending on which set "Food product" is in
for i, r in df6.iterrows():
  if r["Food product"] in PLANT_PRODUCTS:
    df6.loc[i, "Food type"] = "Plant"
  elif r["Food product"] in ANIMAL_PRODUCTS:
    df6.loc[i, "Food type"] = "Animal"

df5 = df5.groupby("Food type").mean() # Group df5 by "Food type" and average the environmental impact values
df5["Food type"] = df5.index # Re-assign "Food type" to be a column after group by (so we can access it from its column name)
df5.reset_index(drop = True, inplace = True) # Reset our index for clarity
df5 # Print the first few entries of the dataframe
Out[107]:
Land use change Animal Feed Farm Processing Transport Packging Retail Total_emissions Food type
0 2.810000 1.95 10.490000 0.500000 0.240000 0.220000 0.180000 16.390000 Animal
1 0.790909 0.00 1.342424 0.178788 0.181818 0.284848 0.036364 2.815152 Plant

Next, we will do the same for df6. The code is separate so we can see both outputs.

In [108]:
df6 = df6.groupby("Food type").mean() # Group df6 by "Food type" and average the environmental impact values
df6["Food type"] = df6.index # Re-assign "Food type" to be a column after group by (so we can access it from its column name)
df6.reset_index(drop = True, inplace = True) # Reset our index for clarity
df6 # Print the first few entries of the dataframe
Out[108]:
Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal) Eutrophying emissions per kilogram (gPO₄eq per kilogram) Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein) Freshwater withdrawals per 1000kcal (liters per 1000kcal) Freshwater withdrawals per 100g protein (liters per 100g protein) Freshwater withdrawals per kilogram (liters per kilogram) Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal) Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein) Land use per 1000kcal (m² per 1000kcal) Land use per kilogram (m² per kilogram) Land use per 100g protein (m² per 100g protein) Scarcity-weighted water use per kilogram (liters per kilogram) Scarcity-weighted water use per 100g protein (liters per 100g protein) Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories) Food type
0 58.085221 139.423333 73.289922 906.822192 1230.689663 2102.944444 10.436802 14.490724 34.723544 97.806667 51.590835 70855.533333 41650.170546 28049.055284 Animal
1 23.109872 20.742667 49.746990 422.777740 1728.610704 711.420000 5.741551 14.587478 5.421294 8.788667 19.385191 28073.860000 76078.624420 15746.276693 Plant

We decided to average the environmental impact as a way to properly compare the average imapct of animal vs plant foods on the environment. Getting the total wouldn't be as fair since there are more plant products than animal products listed.

Next, we will split df2 into three dataframes: df7, df8, and df9. df7 will contain 2022 data on relative meat consumption per country. df8 will contain data on how each country's meat consumption changes every year. df9 will contain data on how all countries' meat consumption changes every year.

In [109]:
df7 = df2.loc[df2["TIME"] == 2022] # Filter to contain only 2022 data and store it as df7
df7.reset_index(drop = True, inplace = True) # Reset our index for clarity
df7.head() # Print the first few entries of the dataframe
Out[109]:
LOCATION SUBJECT MEASURE TIME Value
0 AUS BEEF KG_CAP 2022 19.951395
1 AUS PIG KG_CAP 2022 20.795147
2 AUS POULTRY KG_CAP 2022 45.974985
3 AUS SHEEP KG_CAP 2022 8.606749
4 CAN BEEF KG_CAP 2022 18.426918

We should group the data by location since we want to observe the total meat consumption per country in our visualization.

In [110]:
df7 = df7.groupby("LOCATION").mean() # Group the data by location and average other values
df7["LOCATION"] = df7.index # Add "LOCATION" back as a column (after it's removed by groupby)
df7.reset_index(drop = True, inplace = True) # Reset the index and remove the "LOCATION" index
df7.drop(columns = ["TIME"], inplace = True) # Drop "TIME" since we're only looking at 2022
df7.head() # Print the first few entries of the dataframe
Out[110]:
Value LOCATION
0 654.438777 ARG
1 395.792480 AUS
2 92.761958 BGD
3 2742.902830 BRA
4 421.909062 CAN

Since we aim to visualize this data, we need to come up with a solid range of values to separate relative meat consumption into categories so we can clearly see which countries consume more meat than others.

Let's sort this data and view it all.

In [111]:
df7.sort_values("Value") # Sort df7 (not inplace) and print its results
Out[111]:
Value LOCATION
13 28.316855 HTI
22 40.967186 MOZ
12 48.032780 GHA
43 48.279836 ZMB
39 58.287311 URY
26 63.785447 NZL
30 72.225519 PRY
37 77.438105 TZA
11 84.938557 ETH
2 92.761958 BGD
25 97.187500 NOR
17 119.053703 ISR
9 119.396307 DZA
33 119.653849 SDN
19 138.128745 KAZ
5 149.544583 CHE
24 199.659620 NGA
6 224.346100 CHL
35 247.684181 THA
28 247.995524 PER
23 279.695257 MYS
32 286.645753 SAU
38 304.334280 UKR
10 331.591675 EGY
8 357.508500 COL
16 378.662937 IRN
36 395.481716 TUR
1 395.792480 AUS
4 421.909062 CAN
42 443.961610 ZAF
27 451.054780 PAK
20 473.225893 KOR
14 498.350871 IDN
29 524.246670 PHL
0 654.438777 ARG
15 709.425892 IND
18 720.685672 JPN
41 893.342928 VNM
21 1044.307995 MEX
31 1393.345149 RUS
34 1468.064750 SSA
3 2742.902830 BRA
40 5193.676467 USA
7 11695.816209 CHN

From this, we can intuitively suggest the following range values: 0, 100, 500, 1000, 5000, 10000, and 20000. This will be important when visualizing df7 later.

Next, we can create df8 and df9 accordingly.

In [112]:
df8 = df2.groupby(["LOCATION", "TIME"]).mean() # Group by df2 by location and time by mean
df8.reset_index(inplace = True) # Reset index for clarity (and to maintain intiail column values)
df8.head() # Print the first few entries of the dataframe
Out[112]:
LOCATION TIME Value
0 ARG 1990 524.747862
1 ARG 1991 393.885581
2 ARG 1992 420.682693
3 ARG 1993 447.727484
4 ARG 1994 436.872628
In [113]:
df9 = df2.groupby("TIME").mean() # Group by df2 by only time by mean
df9.reset_index(inplace = True) # Reset index for clarity (and to maintain intiail column values)
df9.head() # Print the first few entries of the dataframe
Out[113]:
TIME Value
0 1990 346.211080
1 1991 366.204909
2 1992 384.500097
3 1993 401.016257
4 1994 424.126743

Again, we use mean because we want to see how meat consumption varies on average.

Now that we have managed and prepared all of our dataframes, we can go into the visual representation of this data. After doing this, we can analyze what the visualization means.

Data Representation¶

For our data representation, we will first create three visualizations: point plot based on relative food consumption (df3), bar plot based on metric-based food consumption (df4), and global map based on meat consumption per country (df7).

This will offer a good starting point of seeing how plant vs animal products compare and which countries are doing worse than others.

In [114]:
import seaborn as sns, matplotlib.pyplot as plt # Import Seaborn and Matplotlib for visualization

sns.set(rc = {"figure.figsize": (25, 25)}) # Set figure size to be large and clear

_, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2) # Create subplots to place four point plots on

p1 = sns.pointplot(x = "Food product", y = "Land use change", color = "red", data = df3, ax = ax1) # Create first (red) point plot with "Food product" against "Land use change"
p1.set(xlabel = "FOOD PRODUCT", ylabel = "LAND USE CHANGE") # Give appropriate names for the axis for clearer understanding
p1.set_xticklabels(p1.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p2 = sns.pointplot(x = "Food product", y = "Animal Feed", color = "green", data = df3, ax = ax2) # Create second (green) point plot with "Food product" against "Animal feed"
p2.set(xlabel = "FOOD PRODUCT", ylabel = "ANIMAL FEED") # Give appropriate names for the axis for clearer understanding
p2.set_xticklabels(p2.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p3 = sns.pointplot(x = "Food product", y = "Transport", color = "blue", data = df3, ax = ax3) # Create third (blue) point plot with "Food product" against "Transport"
p3.set(xlabel = "FOOD PRODUCT", ylabel = "TRANSPORT") # Give appropriate names for the axis for clearer understanding
p3.set_xticklabels(p3.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p4 = sns.pointplot(x = "Food product", y = "Total_emissions", color = "yellow", data = df3, ax = ax4) # Create fourth (yellow) point plot with "Food product" against "Total_emissions"
p4.set(xlabel = "FOOD PRODUCT", ylabel = "TOTAL EMISSIONS") # Give appropriate names for the axis for clearer understanding
p4.set_xticklabels(p4.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

ax1; ax2; ax3; ax4 # Print the four point plots
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb10e07f1f0>

The graph shows the relative plant versus animal food consumption in relation to four environmental factors: land use change, animal feed, transport, and total emissions.

  • Land use change: The graph indicates that beef and dark chocolate production result in the highest land use change, while nuts have the lowest (in the negatives). Oil and animal byproducts tend to be moderate, at around 2.5-5.0. It is evident that beef production has a significant impact on land usage.
  • Animal feed: The graph shows that all plant products do not use animal feed, while all animal products do. All animal products seem to require a large amount of animal feed, except for milk, which is relatively low. It is clear that plant products are more efficient in this regard.
  • Transport: The graph indicates that transport emissions vary regardless of whether the product is plant or animal-derived. Sugar appears to have the highest transport emissions, followed by oils and red meat. Transportation seems to be a problem for all products, not just plant or animal-derived ones.
  • Total emissions: The graph shows that animal products have higher total emissions than plant products. While cocoa products have relatively high emissions, animal products, especially meat, have significantly higher emissions.
In [115]:
_, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2) # Create subplots to place four bar plots on

p1 = sns.barplot(x = "Food product", y = "Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)", data = df4, ax = ax1) # Create first bar plot with "Food product" against "Eutrophying emissions"
p1.set(xlabel = "FOOD PRODUCT", ylabel = "EUTROPHYING EMISSIONS (gPO₄eq per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p1.set_xticklabels(p1.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p2 = sns.barplot(x = "Food product", y = "Freshwater withdrawals per 1000kcal (liters per 1000kcal)", data = df4, ax = ax2) # Create second bar plot with "Food product" against "Freshwater withdrawals"
p2.set(xlabel = "FOOD PRODUCT", ylabel = "WATER USE (liters per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p2.set_xticklabels(p2.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p3 = sns.barplot(x = "Food product", y = "Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal)", data = df4, ax = ax3) # Create third bar plot with "Food product" against "Greenhouse gas emissions"
p3.set(xlabel = "FOOD PRODUCT", ylabel = "GREENHOUSE GAS EMISSIONS (kgCO₂eq per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p3.set_xticklabels(p3.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p4 = sns.barplot(x = "Food product", y = "Land use per 1000kcal (m² per 1000kcal)", data = df4, ax = ax4) # Create fourth bar plot with "Food product" against "Land use"
p4.set(xlabel = "FOOD PRODUCT", ylabel = "LAND USE (m² per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p4.set_xticklabels(p4.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

ax1; ax2; ax3; ax4 # Print the four bar plots
Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb10dcdac40>

The four bar plots show data on eutrophying emissions, water use, greenhouse gas emissions, and land use. The data is presented as a metric per 1000kcal.

  • Eutrophying emissions: Coffee has the highest eutrophying emissions, followed by beef and fish. The remaining products are similar in terms of emissions. Nuts have the lowest emissions. Overall, plant products tend to have lower emissions in this category, but there is some variation.
  • Water use: Fish and tomato use a lot of water. Other products have similar water use, with animal products possibly using slightly more on average. Cocoa products have low water use.
  • Greenhouse gas emissions: Cocoa products have high greenhouse gas emissions, followed by red meat and tomato. Other animal byproducts and plant products have much lower emissions.
  • Land use: Animal products require a lot of land, with cocoa and other animal byproducts close behind. Plant products do not require much land.
In [116]:
import folium # Import folium to create map
from branca.element import Figure # Import Figure from branca.element to fit map

fig = Figure(width = 1250, height = 750) # Create figure to fit the map on
world_map = folium.Map(location = [42.5, 0], zoom_start = 2) # Create map and focus on all relevant countries
geo_data = "https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson" # Get geojson data to mark regions of countries
columns = ["LOCATION", "Value"] # Denote the columns that the "LOCATION" will be set on and the "Value" that's measured
threshold_scale = [0, 100, 500, 1000, 5000, 10000, 20000] # Provide our intuitive threshold scale here

folium.Choropleth(
    geo_data = geo_data, # Provide geo data
    data = df7, # Provide dataframe (df7)
    columns = columns, # Provide columns
    key_on = "properties.ISO_A3", # Provide metric to figure out location regions on 
    fill_color = "YlOrRd", # Provide color scheme
    legend_name = "Relative Meat Consumption", # Provide a clear legend name
    threshold_scale = threshold_scale # Provide threshold scale
).add_to(world_map) # Add a created world map to the created map

fig.add_child(world_map) # Add map onto the figure
fig # Print the figure
Out[116]:

The map shows the largest meat consumers in the world. It is clear that China and the US are the biggest consumers, followed by Russia, Mexico, and Brazil. While India is also in this category, we must consider its large population size, which means its meat consumption is relatively low in comparison. Other countries have very low meat consumption. Therefore, efforts to reduce meat consumption should be focused on the US and China.

Now that we have visualized all of our data, we can more thoroughly analyze it and gain a clearer understanding of what we want to focus on.

Data Analysis¶

In our data analysis, we will compare plants more directly with meat in terms of environmental impact and look at whether populations are increasing their meat consumption.

We will make similar count/bar plots, but instead use df5 and df6 to compare plant and animal products directly.

In [117]:
sns.set(rc = {"figure.figsize": (20, 20)}) # Set figure size to be large and clear

_, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2) # Create subplots to place four point plots on

p1 = sns.pointplot(x = "Food type", y = "Land use change", color = "red", data = df5, ax = ax1) # Create first (red) point plot with "Food product" against "Land use change"
p1.set(xlabel = "FOOD TYPE", ylabel = "LAND USE CHANGE") # Give appropriate names for the axis for clearer understanding
p1.set_xticklabels(p1.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p2 = sns.pointplot(x = "Food type", y = "Animal Feed", color = "green", data = df5, ax = ax2) # Create second (green) point plot with "Food product" against "Animal feed"
p2.set(xlabel = "FOOD TYPE", ylabel = "ANIMAL FEED") # Give appropriate names for the axis for clearer understanding
p2.set_xticklabels(p2.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p3 = sns.pointplot(x = "Food type", y = "Transport", color = "blue", data = df5, ax = ax3) # Create third (blue) point plot with "Food product" against "Transport"
p3.set(xlabel = "FOOD TYPE", ylabel = "TRANSPORT") # Give appropriate names for the axis for clearer understanding
p3.set_xticklabels(p3.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p4 = sns.pointplot(x = "Food type", y = "Total_emissions", color = "yellow", data = df5, ax = ax4) # Create fourth (yellow) point plot with "Food product" against "Total_emissions"
p4.set(xlabel = "FOOD TYPE", ylabel = "TOTAL EMISSIONS") # Give appropriate names for the axis for clearer understanding
p4.set_xticklabels(p4.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

ax1; ax2; ax3; ax4 # Print the four point plots
Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb10e3e4a90>

From the given graphs, it's abundantly clear that plant products fare better in all aspects (land use change, animal feed, transport, and total emissions). However, it's exemplified more through emissions and animal feed.

In [118]:
_, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2) # Create subplots to place four bar plots on

p1 = sns.barplot(x = "Food type", y = "Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)", data = df6, ax = ax1) # Create first bar plot with "Food product" against "Eutrophying emissions"
p1.set(xlabel = "FOOD TYPE", ylabel = "EUTROPHYING EMISSIONS (gPO₄eq per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p1.set_xticklabels(p1.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p2 = sns.barplot(x = "Food type", y = "Freshwater withdrawals per 1000kcal (liters per 1000kcal)", data = df6, ax = ax2) # Create second bar plot with "Food product" against "Freshwater withdrawals"
p2.set(xlabel = "FOOD TYPE", ylabel = "WATER USE (liters per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p2.set_xticklabels(p2.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p3 = sns.barplot(x = "Food type", y = "Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal)", data = df6, ax = ax3) # Create third bar plot with "Food product" against "Greenhouse gas emissions"
p3.set(xlabel = "FOOD TYPE", ylabel = "GREENHOUSE GAS EMISSIONS (kgCO₂eq per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p3.set_xticklabels(p3.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

p4 = sns.barplot(x = "Food type", y = "Land use per 1000kcal (m² per 1000kcal)", data = df6, ax = ax4) # Create fourth bar plot with "Food product" against "Land use"
p4.set(xlabel = "FOOD TYPE", ylabel = "LAND USE (m² per 1000kcal)") # Give appropriate names for the axis for clearer understanding
p4.set_xticklabels(p4.get_xticklabels(), rotation = 90) # Make x axis vertical so it can fit on graph

ax1; ax2; ax3; ax4 # Print the four bar plots
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb108166d90>

Like before, these graphs also show that plant products fare better in environmental impact. Land use definitely stands out the most for this point.

Now that we've analyzed the environmental impact of animal vs plant foods, let's look at how meat consumption changes over the years by country. We will be focusing on df8 for when we want to look at per country data, but df9 will be there to see what the global trend looks like.

In [119]:
p = sns.scatterplot(x = "TIME", y = "Value", data = df8) # Create scatter plot with "TIME" against "Value" (meat consumption)
p.set(xlabel = "YEAR", ylabel = "MEAT CONSUMPTION") # Give appropriate names for the axis for clearer understanding
p # Print the plot
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb1080ff760>

Here, we've created a basic scatterplot to show the general trend per country of meat consumption over the years. Although it seems to vary by country, there seems to be a clear positive correlation with time and meat consumption.

In [120]:
p = sns.scatterplot(x = "TIME", y = "Value", hue = "LOCATION", data = df8) # Create scatter plot with "TIME" against "Value" (meat consumption) with a "LOCATION" hue
p.set(xlabel = "YEAR", ylabel = "MEAT CONSUMPTION") # Give appropriate names for the axis for clearer understanding
p # Print the plot
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb1080b9e80>

When we clearly denote countries using color, we can start to see some trends by country. China, the US, and Brazil all seem to be increasing their meat consumption by quite a lot. Other countries also seem to be increasing their meat consumption, but not as largely.

In [121]:
p = sns.lmplot(x = "TIME", y = "Value", hue = "LOCATION", data = df8, height = 16) # Create linear regression plot with "TIME" against "Value" (meat consumption) with a "LOCATION" hue (set height so it fits)
p.set(xlabel = "YEAR", ylabel = "MEAT CONSUMPTION") # Give appropriate names for the axis for clearer understanding
p # Print the plot
Out[121]:
<seaborn.axisgrid.FacetGrid at 0x7fb1080ccb50>

Adding a regression line for each country makes it a lot more obvious that the majority of countries are increasing their meat consumption.

In [122]:
p = sns.lineplot(x = "TIME", y = "Value", data = df9) # Create line plot with "TIME" against "Value" (meat consumption)
p.set(xlabel = "YEAR", ylabel = "MEAT CONSUMPTION") # Give appropriate names for the axis for clearer understanding
p # Print the plot
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb107ceda30>

When we use df9 to measure a general trend in meat consumption, we get a clear positive correlation with time and meat consumption.

We will go back to using df8 here, but now we will implement a Linear Regression model over the dataset to predict how values will change over the years and whether or not there is an increase or decrease in meat consumption.

In [123]:
from sklearn.linear_model import LinearRegression # Import LinearRegression from SKLearn to predict values based on year

lr = LinearRegression() # Create LinearRegression model
lr.fit(df8[["TIME"]].values, df8[["Value"]].values) # Fit model over data ("TIME" and "Value" (meat consumption))

p = sns.scatterplot(x = "TIME", y = "Value", data = df8) # Create scatter plot with "TIME" against "Value" (meat consumption)
p.plot(df8[["TIME"]], lr.predict(df8[["TIME"]].values), color = "red") # Plot linear regression line to illustrate line of best fit
p.set(xlabel = "YEAR", ylabel = "MEAT CONSUMPTION") # Give appropriate names for the axis for clearer understanding

print(f"y = {lr.coef_[0][0]}x + {lr.intercept_[0]}") # Print the y-formula (note that plot is implicitly printed too) 
y = 13.235394229788353x + -25971.11890947386

The Linear Regression line also shows a positive correlation with time and meat consumption. This is also indicated with the equation.

Now that we have analyzed our data, we can run hypothesis tests to confirm our assumptions.

Hypothesis Testing¶

In hypothesis testing, we will conduct two tests: Mann Whitney U and two tailed. The Mann Whitney U will be used to check if the means of the two independent non-parametric distributions based on animal/plant environmental imapct is signifaicntly different. The two tailed test will be used to check if the linear regression's 2000 vs 2020 values are significantly different. If both are significant, this will confirm that animal products are associated with higher environmental impact and that meat consumption is growing every year.

In [124]:
from scipy.stats import mannwhitneyu # Import ranksums from SciPY

SIZE = 8 # Set size of mean list to 8
OFFSET = 0.1 # Set offset value to prevent division by zero
DEFAULT = 1 # Set default ratio value for "Plant"
THRESHOLD = 0.05 # Set threshold for significant difference in measurement

animal_values0 = df5.iloc[0][["Land use change", "Animal Feed", "Transport", "Total_emissions"]].tolist() # Set relative animal environmental impact means as list
animal_values1 = df6.iloc[0][["Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)", "Freshwater withdrawals per 1000kcal (liters per 1000kcal)", "Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal)", "Land use per 1000kcal (m² per 1000kcal)"]].tolist() # Set metric-based animal environmental impact means as list
animal_values = animal_values0 + animal_values1 # Put lists together

plant_values0 = df5.iloc[1][["Land use change", "Animal Feed", "Transport", "Total_emissions"]].tolist() # Set relative plant environmental impact means as list
plant_values1 = df6.iloc[1][["Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)", "Freshwater withdrawals per 1000kcal (liters per 1000kcal)", "Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal)", "Land use per 1000kcal (m² per 1000kcal)"]].tolist() # Set metric-based plant environmental impact means as list
plant_values = plant_values0 + plant_values1 # Put lists together

# Adjust the values to be the ratio of animal vs plant (plant values set to 1 and animal values set to animal value / plant value to get a fair ratio)
for i in range(SIZE): 
  animal_values[i] = (animal_values[i] + OFFSET) / (plant_values[i] + OFFSET)
  plant_values[i] = DEFAULT

_, p = mannwhitneyu(animal_values, plant_values) # Run hypothesis test on values

# Display if means are significantly different or not based on threshold
if p < THRESHOLD:
    print("Alternative Hypothesis: The means are significantly different.")
else:
    print("Null Hypothesis: The means are not significantly different.")
Alternative Hypothesis: The means are significantly different.

As can be seen, the means are significantly different. When going through the data, since we're aggregating values that have different metric, it was best to add a ratio between plant and animal values in order to run a fair hypothesis test on the results.

In [125]:
pred_2000 = lr.predict([[2000]])[0][0] # Predict value for 2000
pred_2020 = lr.predict([[2020]])[0][0] # Predict value for 2020

diff = abs(pred_2000 - pred_2020) # Get absolute value of difference of predicted values
std = df8["Value"].std() # Get standard deviation of values

# Display if difference are significant or not based on standard deviation and threshold
if diff > std * THRESHOLD:
    print("Alternative Hypothesis: The difference is significant.")
else:
    print("Null Hypothesis: The difference is not significant.")
Alternative Hypothesis: The difference is significant.

We wanted to check 2000 vs 2020 to see if meat consumption increased. We ran a two tailed test on the difference and the alternative hypothesis turned out to be true: there was a significant difference. So, meat consumption did increase.

After conducting our necessary technical aspects, we can communicate our overall insights.

Insights¶

Technical Decisions & External Resources¶

Before I go through the insights of our analysis, we should take time to tackle the meta of how we approached our tutorial. This guide, although it has the objective goal to show the negative impact of animal agriculture, is also here to serve as a data science tutorial. While it is somewhat difficult to detail everything step-by-step, I do my best by explaining certain decisions I've made throughout and how code would look like in a simple way. I have also utilized repetition to ingrain how to tackle certain technical problems.

With that being said, I will discuss the libraries I've utilized and general approaches to using them effectively.

Pandas¶

When using Pandas, I recommend reading the docs, which are provided above, but using the cheat sheet is very helpful when you want a quick reference. In coding, as I've done in this tutorial, you start by extracting data from a CSV (or other file). Then, you have a dataframe containing relevant data for your analysis. At this point, you can manipulate the dataset according to your needs using the cheat sheet and then call the head() function for inspection.

Seaborn & Matplotlib¶

I decided to use Seaborn along with Matplotlib instead of just using Matplotlib by itself. This is because Seaborn offers beautiful and simple techniques for visualization; beauty can be an effective element to convey a point as it attracts observation. When using Seaborn with Matplotlib, you can create subplots as I did. Additionally, it's important to mention that with Seaborn, there is a pretty consistent technique to graphing and it's usually just calling sns.someplot(x = ?, y = ?, data = ?). Then, you can label and manipulate the axis as I did. I recommend checking out the documentation I provided above.

SKLearn¶

With SKLearn, there are multiple supervised and unsupervised learning algorithms you can use. The way I used linear regression, you can use a variety of other algorithms such as K-nearest neighbors, random forest, decision tree, etc. Moreover, you can run K-fold cross validation on them to test data appropriately. I recommend checking the documentation I provided above.

SciPY¶

With SciPY, you can import a variety of hypothesis tests (T-test, U-test, etc.) and run tests on data simiarly to what I did above. You can decide your hypothesis test based on this flowchart. I recommend checking out the documentation I provided above.

Animal Agriculture¶

From our data and observations, we can conclude that animal agriculture is linked to higher negative environmental impact and that countries, in particular the US and China, are unfortunately increasing their meat consumption. As a society, if we want to preserve our planet, we should aim to consume more whole plant-based foods as opposed to animal foods.